-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
WIP: Dict Array Extension #29557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Dict Array Extension #29557
Conversation
Little bit of a wet blanket, but... why? nested objects are among the corneriest of corner cases; i would expect we'd want to discourage them. Or is the idea to stuff all of the corner stuff into a couple of EAs? |
Right this would be a way to make things less thorny. Nested record types are pretty common with DBMS nowadays so this would mirror some aspects of that.
…Sent from my iPhone
On Nov 11, 2019, at 4:16 PM, jbrockmendel ***@***.***> wrote:
Little bit of a wet blanket, but... why? nested objects are among the corneriest of corner cases; i would expect we'd want to discourage them. Or is the idea to stuff all of the corner stuff into a couple of EAs?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
|
||
def __setitem__(self, key: Hashable, value: Any): | ||
def check_value(value): | ||
if not (pd.isnull(value) or isinstance(value, dict)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could potentially make this pretty class pretty generic adding an abstract class for all Container types then leave it to subclasses for a few minor things to clarify list vs set vs dict
raise ValueError( | ||
"DictArray requires a sequence of dicts. Got " | ||
"'{}' dtype instead.".format(self._ndarray.dtype) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to extend this validation for object dtypes on construction that contain more than just dicts...
Question about the scope of this DictArray: in general, I think there are two possible cases: fixed keys and variable keys. In Arrow terminology, this is called StructArray (fixed set of keys in each record) or MapArray (variable set of keys in each record). Those are regarded separately, because they are stored in memory quite differently (the StructArray stores one contiguous array of values for each key per column and can have different types for the values of the different keys, while MapArray requires all values to be of the same type and stores them in one single array with offsets, and also the keys in one contiguous array). Also Spark makes the same distinction. A python dict is more close to the map type (but also with variable types for the values), but just wondering to make the specification explicit. |
A bit more on scope / names: postgres has a JSON data type that basically combines nested data support for lists and mappings. Would we want that, or is there value in separating lists from mappings at a dtype level? If we split lists and dicts into their own extension dtypes, I'd vote to call this a MappingArray rather than a DictArray. It's a bit more general, and I think at some point we'll have a "categorical without fixed categories" array, which pyarrow calls a dictionary encoded array. I'm not sure we would call that a DictEncodedArray, but just in case. |
yeah I think we would be well served by the distinct types that @TomAugspurger indicates a nested type that holds dicts, lists, scalars and a DictEncoded type that backs Categorical and String a pure Dict type is fine for testing |
Joris - that’s a pretty interesting thought. Yea this would align more closely to the mapping type (at least at the outset I think easier to implement) but I could see the advantage of both implementations over time.
Tom - I guess the downside to a Mapping type is that it by definition wouldn’t support assignment so we’d end up with a MutableMapping array as well or just not care and allow assignment to Mapping. I don’t know that is better or more intuitive to an end user, but maybe.
Jeff might be misunderstanding your response but are you in favor of the JSON type? I think ensuring JSON semantics for an extensionarray is probably too strict (I.e. object keys have to be strings). Though if not specific to JSON we could think about an abstract container extension that a Dict, list, set, etc... extension inherits from while adding support for methods specific to those objects.
…Sent from my iPhone
On Nov 12, 2019, at 4:09 AM, Jeff Reback ***@***.***> wrote:
yeah I think we would be well served by the distinct types that @TomAugspurger indicates a nested type that holds dicts, lists, scalars and a DictEncoded type that backs Categorical and String
a pure Dict type is fine for testing
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Being a mapping doesn't mean you're immutable. It just doesn't guarantee that you are mutable, if that makes sense :) In [3]: import collections.abc
In [4]: isinstance(dict(), collections.abc.Mapping)
Out[4]: True |
Yea should have qualified that as “always” support assignmnent. I would think that 99% of “mapping use cases” in Python just use Dict (totally made up stat) so I would be hesitant to introduce an extension type that may not explicitly allow that
…Sent from my iPhone
> On Nov 12, 2019, at 7:22 AM, Tom Augspurger ***@***.***> wrote:
downside to a Mapping type is that it by definition wouldn’t support assignment so we’d end up with a MutableMapping array as well or just not care and allow assignment to Mapping.
Being a mapping doesn't mean you're immutable. It just doesn't guarantee that you are mutable, if that makes sense :)
In [3]: import collections.abc
In [4]: isinstance(dict(), collections.abc.Mapping)
Out[4]: True
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
But do we want the mapping elements to be mutable? Should something like In general, IMO we should think a bit more about use cases for those different options. What do other languages do? What kind of types are users looking for? Putting dicts in an object array is certainly the easiest to get a basic array working, and might be the closest to the objects available in Python, but it doesn't necessarily give what many users might need / want (regarding mutability of values, expanding in place with new keys, variable key and value types, ...). Eg if we want an optimized implementation later, those requirements might be very hard to implement. |
tight on time recently...may come back |
modeled after some of @TomAugspurger great work in #27949 here is a dirty implementation of a DictArray. Marking as a WIP as still need to work through all of the remaining extension array types, though figure can put out there for feedback if anyone cares in the meantime
Definitely some overlap with the existing JSONArray; could integrate some aspects of that into this or vice versa